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Figure 1: Our system is trained exclusively on synthetic data obtained from our scene library, SynthCam3D. During testing, 
per-frame predictions returned by the network are fused using the camera poses provided by the reconstruction system. 

Abstract suits. 


We are interested in automatic scene understanding from 
geometric cues. To this end, we aim to bring semantic seg¬ 
mentation in the loop of real-time reconstruction. Our se¬ 
mantic segmentation is built on a deep autoencoder stack 
trained exclusively on synthetic depth data generated from 
our novel 3D scene library, SynthCam3D. Importantly, our 
network is able to segment real world scenes without any 
noise modelling. We present encouraging preliminary re- 


1. Introduction 

Fully automatic understanding of 3D scenes is of par¬ 
ticular interest for many attractive applications that demand 
interaction with objects and/or primitive parts that make up 
the scene Ema. Such knowledge is indispensable for a 
robot to be able to perform fully autonomously basic inter¬ 
actions with its environment, like moving objects, clearing 
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Figure 2: SynthCam3D is a library of synthetic indoor scenes collected from various online 3D repositories and hosted at 

http://robotvault.bitbucket.org. 



the clutter, stacking objects on top of others, or searching 
for objects in their likely locations. These actions require 
richer understanding of the scene than e.g. the per-image la¬ 
bels from image classification approaches or object bound¬ 
ing boxes provided by object detectors. 

We believe that a key step towards whole scene under¬ 
standing is the semantic segmentation of the scene. Our 
work brings together two established directions towards the 
goal of 3D scene understanding: 3D reconstruction and 
deep learning-based semantic segmentation. Here, we ex¬ 
ploit the inherent dependency between reconstruction and 
segmentation — per-frame labels are fused using their re¬ 
spective camera poses returned by the reconstruction sys¬ 
tem. In doing so, we particularly stress the importance of 
treating data coming as a video stream. On an average, seg¬ 
mentations from different viewpoints, when fused, should 
yield a result better than a segmentation from any particu¬ 
lar view. Our system is directly related to Hermans et aV s 
work 18], who fuse per-frame segmentations obtained with 
randomised decision forests from RGB-D images; they use 
2D and 3D dense CRFs j9j] to smooth the per-frame 2D 
segmentations and the fused 3D segmentation, respectively. 
We harness recent advances made in deep learning to ob¬ 
tain per-frame dense predictions. Our deep architecture, in¬ 
spired from ED, is composed of stacked autoencoders and 
trained modularly. For all our experiments, we use depth 
data as the only cue for 3D scene understanding. The mo¬ 
tivation of using depth images is twofold: firstly, depth dis¬ 
continuities are very important for object recognition as has 
been shown in [5|, and secondly, the convenience in obtain¬ 
ing depth data. Using only depth cues spares us from the 
complications of dealing with the infinite space of possible 
textures and lighting setups, making it tractable to collect 


a representative set of scenes in terms of scene layout and 
objects distribution. The challenge in this context is to in¬ 
vestigate if depth data is a sufficient input for semantic seg¬ 
mentation. 

We make publicly available a new library - Synth- 
Cam3D - consisting of a significant number of labelled syn¬ 
thetic 3D scenes and associated code for generating depth 
maps and their corresponding annotations. The scenes be¬ 
long to different semantic categories and have been com¬ 
piled together from various online 3D repositories d, and 
manually annotated. Large public repositories (e.g. Trimble 
Warehouse) of 3D CAD models have existed in the past, 
but they have mainly served the graphics community. It is 
only recently that we have started to see emerging interest 
in synthetic data for computer vision. The advantages of 
synthetic 3D models cannot be overstated, especially when 
considering scenes: once a 3D annotated model is available, 
it allows rendering as many 2D annotated views as desired, 
at any resolution and frame-rate. In comparison, existing 
datasets of real data are fairly limited both in the number of 
annotations and the amount of data. NYUv2 ff4ft provides 
only 795 training images for 894 classes; hence learning 
any meaningful features characterising a class of objects 
becomes prohibitively hard. SynthCam3D is particularly 
useful for: 

• Generating potentially unlimited high-quality anno¬ 
tated depth data for different types of scenes (Fig. [3]). 

• Benchmarking large scale depth-only SLAM systems 
on complex scenes, by providing ground truth geome- 
try IQ. 

• Enabling training generative models similar to e.g. 
Go), to learn common scene layouts and object rela- 









Figure 3: Samples of annotated images rendered at various camera poses for an office scene taken from SynthCam3D. 


tionships, which can then be used to synthesize more 
scenes effortlessly. 

In the following, we describe SynthCam3D and briefly 
outline our system trained using data generated from the 
library. Preliminary results show the usefulness of the pro¬ 
posed library for training deep architectures for semantic 
segmentation of real world scenes. With a careful choice of 
input features to our deep learning network and using depth 
maps raycasted by the reconstruction system, we are able 
to bypass the domain adaptation issues that have been ob¬ 
served in the past i.e. the system trained on synthetic depth 
data can be directly applied to segment real depth data, 
without the need of noise modelling at training time. 

2. SynthCam3D Library 


Category 

Number of 3D models 

Bedrooms 

11 

Office Scenes 

15 

Kitchens 

11 

Living Rooms 

10 

Bathrooms 

10 


Table 1: Different scene categories and the number of annotated 
3D models for each category. 

SynthCam3D contains 3D models from five different 
scene categories: bedroom, office, kitchen, living-room, 
and bathroom, with at least 10 annotated scenes per cate¬ 
gory. Importantly, all the 3D models are in metric scale. 
Each scene is composed of up to around 50-150 objects 
and the complexity can be controlled algorithmically. The 
granularity of the annotations can be adapted by the user 
depending on the application, e.g. in our experiments on 
bedroom scenes we condensed the number of classes down 
to 15 for generating data and understanding only functional 
categories of objects. The models are provided in .obj for¬ 
mat, together with the code and camera settings needed to 
set up the rendering using POV-Ray. A simple OpenGL 
based GUI allows the user to place virtual cameras in the 
synthetic scene at desired locations to generate a possible 
trajectory for rendering at different viewpoints. Fig. [3] 
shows samples of rendered annotated views of a simple of¬ 
fice scene. 


3. Rendering Engine 

We use the popular ray-tracer POVRay for our render¬ 
ing purposes, being inspired by the past work of Handa et 
al. [6'|. To render depth maps with associated annotations 
from the .obj models, we first need to convert the .obj mod¬ 
els to their corresponding POVRay files using Posera^ 
Then the camera extrinsic parameters are set with a 3x4 
matrix inside the main POVRay file (having the .pov ex¬ 
tension). Eventually, a rendering trajectory can be obtained 
by varying the camera parameters inside the main POVRay 
file. Each rendering operation outputs an annotation file, 
a depth map, and a text file containing the associated cam¬ 
era intrinsic and extrinsic parameters. These files are parsed 
with the codes available from CD. Since we only need depth 
and annotations, the rendering procedure is fast, taking less 
than one second per view on a standard desktop machine for 
VGA resolution. 

4. System Overview 

Our system relies on reconstruction front-end running in 
real-time and deep learning back-end that takes in 4D in¬ 
put channels namely, depth, height from ground, angle with 
gravity vector, and curvature (DHAC). The labels obtained 
from different viewpoints are then fused together with the 
classic Bayesian filtering CD on a voxelised volume using 
the camera poses returned by the reconstruction system. We 
observe immediate benefits of performing 3D mapping and 
semantic segmentation in parallel threads: first, at test time, 
we can use depth maps raycasted from the mapping volume, 
which have superior quality compared to raw depth maps; 
this results in improved segmentation results. Second, we 
can improve the overall segmentation of the scene by la¬ 
bel fusion. We briefly describe reconstruction and our deep 
learning architecture below. 

4.1. Reconstruction 

Our reconstruction system is a custom implementation 
of the well-known KinectFusion algorithm |[T2|, wherein 
depth maps are averaged with their truncated sign distance 
representation on a voxelised 3D volume. For all our seg¬ 
mentation experiments, we use raycasted depth maps and 
camera poses obtained via this system module. Finally, we 
align the local reference frame of the reconstruction with 

1 https://sites.google.com/site/poseray/ 













the inertial frame, using the simple and effective optimisa¬ 
tion proposed in ID to obtain the required rotation matrix. 
This allows us to compute features that are invariant to rota¬ 
tion about the gravity axis, i.e. height from the ground plane 
and angle with gravity vector. 

4.2. Segmentation using deep learning 

Our segmentation module is inspired by the deep archi¬ 
tecture used in It is composed of a sequence of stacked 
auto-encoders, with supervised modular training of each 
layer to capture the representative features of the scene at 
different scales and produce dense predictions for each pixel 
in the input depth map. We use this architecture primar¬ 
ily due to its lightweight structure, compared to e.g. ifTTTl . 
which has prohibitive memory requirements. 

We perform preliminary experiments with this network 
on simple scenes composed of chairs and tables. In all our 
experiments, we segment the scene into 5 different classes: 
chairs, tables, floor, ceiling, and wall. Figure [4] shows the 
segmentation results on training data where a clear improve¬ 
ment of the results is evident as layers are added progres¬ 
sively to the network. Figure [5] and [6] show results on real 
world scenes where we are able to get good segmentations; 
the training was done exclusively on synthetic scenes con¬ 
taining chairs and tables. 

Video Links: http : / /robotvault. bitbucket. 
org/results.html 

5. Conclusion 

We are working towards a real-time system for seman¬ 
tic scene understanding that combines the strengths of 3D 
reconstruction and semantic segmentation. We investigate 
the possibilities of using only depth data for this task and 
we make publicly available a new library containing the 
data and the code necessary to generate high-quality anno¬ 
tations for indoor scenes. Future work includes expanding 
the repository with new synthesised scenes a to learn ef¬ 
fective models for indoor semantic segmentation. 
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Figure 4: Preliminary results of our architecture demonstrate the capabilities to jointly learn pixel-wise classifiers to produce 
a smooth segmentation. From top to bottom, we see how the four different layers of our architecture progressively improve 
the labels. Note that these results are on training set and the colour coding of labels is different. 



Figure 5: Real data results on tables and chairs. First column shows the depth images raycasted from the tsdf volume and 
second column shows the segmentation results. 



Figure 6: Left: results on one chair. Right: results on multiple chairs. Note that the training was done on scenes containing 
both chairs and tables. 













































































